On Intra-page and Inter-page Semantic Analysis of Web Pages
نویسندگان
چکیده
To make real Web information more machine processable, this paper presents a new approach to intra-page and inter-page semantic analysis of Web pages. Our approach consists of Web pages structure analysis and semantic clustering for intra-page semantic analysis, and machine learning based link semantic analysis for inter-page analysis. Based on the automatic repetitive patterns discovery in structure level and clustering in semantic level, we explore the intra-page semantic structure of Web pages and extend the processing unit from the whole page to a finer granularity, i.e., semantic information blocks within pages. After observing the various hyperlinks, we synthesize the Web inter-page semantic and define an information organizing oriented hyperlink semantic category. Considering the presentation of the hyperlink carrier and intra-page semantic structure, we propose corresponding feature selection and quantification methods, and then exploit the C4.5 decision-tree method to classify hyperlink semantic type and analyze the inter-page semantic structure. In our experiments, the results suggest that our approch is feasible for machine processing.
منابع مشابه
Intra/Inter-document Change Awareness for Co-authoring of Web Sites
Systems that support the co-authoring of web sites often allow users to freely edit pages. This can result in semantic inconsistencies within and between pages. We propose a change awareness mechanism that monitors intraand inter-document edits, taking into account changes made to a page and pages connected to it through html or transclusion links. The effect of all the changes is computed base...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملHypertext Semantics for Web Applications
Web applications integrate dynamic pages for publishing data, functions for content management, and generic business services. To support the model-driven design and automatic generation of Web applications, an extended notion of hypertext is required, whose semantics is scarcely investigated. In this paper, we analyse and classify the semantic problems encountered in the computation of Web pag...
متن کاملWeb Page Classification Based on Uncorrelated Semi-Supervised Intra-View and Inter-View Manifold Discriminant Feature Extraction
Web page classification has attracted increasing research interest. It is intrinsically a multi-view and semi-supervised application, since web pages usually contain two or more types of data, such as text, hyperlinks and images, and unlabeled pages are generally much more than labeled ones. Web page data is commonly high-dimensional. Thus, how to extract useful features from this kind of data ...
متن کاملWeb page classification based on a support vector machine using a weighted vote schema
Traditional information retrieval method use keywords occurring in documents to determine the class of the documents, but usually retrieves unrelated web pages. In order to effectively classify web pages solving the synonymous keyword problem, we propose a web page classification based on support vector machine using a weighted vote schema for various features. The system uses both latent seman...
متن کامل